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ABSTRACT 



Frequent pattern mining is emerging and most interesting fields in Data mining. Application of Finding frequent 
item set is very wide like sensor network, web click stream data, and intrusion detection. A data stream is continuous, 
rapid, unbounded sequence of data. Mining Frequent pattern in stream data is very challenging because data can be scan 
one time only. Due to this reason traditional approach cannot be use for data stream. Frequent pattern mining generates 
enormous amount of frequent pattern. However producing all frequent pattern is not suitable. Finding only top- k pattern is 
more attractive whose utility is above a threshold. In addition considering weight factor with support is more realistic 
approach. Our algorithm finds can efficiently find potential top k high utility pattern and encompasses effective pruning 
mechanism. Our experiment results report that it outperforms previous algorithm in terms of runtime, memory usage. 

KEYWORDS: Data Stream, High Utility Pattern Mining, Sliding Window, Frequent Itemset, Top K Pattern Mining 



Data mining is a technique to extract hidden useful information from large database. Here Useful pattern is 
frequent pattern. Item sets which satisfy minimum support threshold value is called Frequent pattern. Frequent pattern is 
very useful because it shows usefulness of item sets. There are many algorithms such as Apriori, FP Growth and eclat 
which can efficiently discover frequent pattern and trends from Database. 

Now a days, Many Organization generates Hugh amount of data and very high speed in nature such as social 
website, sensor network and many other sources. Data which are rapid, unbounded and continuous in nature called as data 
stream. Data stream mining is very challenging and very hot field in data mining community. 

Traditional mining technique is best for static database. Mining frequent pattern from dynamic database is very 
challenging and harder than static database. Essential Requirement for data stream are 1] Data can be scan ones. 2] Data 
arrives continuously so processing should be as fast as possible. 3] Data stream are very large, it keep on arriving but 
memory is limited so algorithm should uses limited or constant memory usage.4]Response time of user query should be 
minimal. Mining in data stream can generates hugh number of frequent pattern but the problem is limitation of storage and 
processing capability. Therefore instead of generating all frequent patterns, generate closed frequent pattern or Maximal 
Frequent pattern is very useful because it represent frequent pattern in more compact form, an item set is closed if there is 
no superset with same support. An item set is maximal if there is no superset. 

In real world, each item has their importance like price or weight. Some item set which is less frequent but can 
generate higher profit. So item set with high profit is important for business. This is known as utility mining. 

There are three data stream processing models which are Landmark, Damped and Sliding windows. Landmark 
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model mines all frequent item sets over the entire history of stream data from a specific time point called landmark to the 
present. The Damped model mines frequent pattern but in this model each transaction has weight which decrease with time 
therefore it is also known as time fading model. The sliding window model mines frequent item set over stream data by 
temporary storing part of data and process after that it removed and moves ahead for new data. In general, Sliding window 
model is used when recent information is more important than historical. 

This paper will focus on the following sections. In Section 2 we discuss relevant research about this topic. In 
Section 3, preliminaries and definition are described. Section 4 Introduces Proposed Framework ICFP. In Section 5, the 
performance of ICFP is presented. At the end Conclusion and future work of this paper are discussed in section 6. 

RELATED WORK 

Frequent pattern mining is first start with static database. In Traditional algorithm, Aprior 
(Agrawal & Srikant,1994) and FP-growth (Han et. al.,2004) are pioneer algorithms. Apriori uses Breadth first search (BFS) 
and generate numbers candidate pattern in mining process and it need to perform scanning repeatedly. FP growth algorithm 
is based on Depth First Search (DFS) which improves mining process. FP-growth algorithm overcomes the problem of 
repeated scan and able to mine pattern with two fixed database scan. Till now, Most of algorithms are based on FP-growth. 

These traditional algorithms are not suitable for data stream environment. Though FP-growth able to mine pattern 
with two scan, it is not suitable for data stream, Since data in data stream can be scan one time only .Based on this fact, 
There are three processing model proposed . 

Sliding Window Based Frequent Pattern Mining Over Data Stream 

Sliding window consider recent data as more important and based on that there are many approaches 
(ahmed et al., 2009; Chen et al., 2012; Deypir et al., 2012; Farzanyar et al., 2012; Li, 2011; Mozafari et al., 2008; Shie et 
al.,2012; Tanbeer et al., 2009b; Zhang & Zhang, 201 1) have been proposed. An Efficient approach proposed by (Tanbeer 
et al., 2009a, 2009b),which uses tree restructuring technique, BSM which perform restructuring operation more effectively 
by path adjusting method, etc.. Algorithmest Win Proposed by Chang and Lee(2005) which finds recent frequent pattern 
and uses reduced minimum support threshold named significance to early monitoring of itemsets before they become 
actually frequent itemset. The Moment algorithm (Chi et al., 2006) finds closed frequent pattern bt maintaining a boundary 
between frequent closed itemset and other itemset. 

Concept Drift 

Koh and Lin(2009) proposed an innovative approach which continuously monitors the incoming transactions to 
detect the occurrence of a concept shift. The frequent itemset are mined when a concept shift observed. F.Nort et al.(2013) 
proposed an approache based on concept drift named TMoment which uses sliding window 

Weighted Condition is Frequent Patter Mining Over Data Stream 

In real world, Each item have their importance known as weight or utility(Price or profit).Not only support but 
weight also play crucial factor in mining process. But the main challenge is to maintain anti-monotone property. Due to the 
fact that weighted infrequent pattern can become weighted frequent which normally destroy that property, Many researcher 
able to maintain this property by variety of methods(Ahmed et al.,2009,2012;Wang & Zeng,201 l;Yun & Ryu,201 l;Yun et. 
al., 201 1,2012). Chang and Lee(2006) introduced an algorithm called estDec based on time decay model in which each 
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transaction has a weight decreasing with age. In this method, in order to reduce the effect of old transactions in the set of 
frequent patterns a decay rate is defined. Up-Growth (Tseng et al., 2010) which is novel algorithm that uses various 
pruning and counting strategies during mining process. IWFP Algorithm (ahmed et al., 2012) is a weighted frequent 
pattern mining, Applying BSM method. THUI-Mine (Tseng et al., 2006) is the first algorithm for mining high utility 
itemsets. WFPMDS (Ahmed et al., 2009) mines weighted frequent pattern using sliding window. The algorithm produce 
recent mining result by using sliding window and uses tree restructuring work with BSM technique. 

In this study, the framework of the proposed algorithm, IFP is based on T-HUDs mining algorithm 
PRELIMINARIES 

A data stream D={B!,B2,....,B N } is a an infinite sequence of batches where each batch Bi contains a set of 
transactions i.e. Bi={T),T 2 ,... ,T K } where k > O.Each transaction T={(i 1 ,q 1 ), (i2,q2), ■ ■■■finAn)} is a set of items where 

i represent an item such as i G I and q represent quantity of item in transaction. An itemset is a non-empty set of items. An 
itemset with size k is called kitemset. A window W is a sliding window which has number of batches 
W= {B1,B2, ...,B m }. 

Defination 1. Utility of an Item I in transactionTj 

Tj is defined as :u(i,Tj)=q(i,Tj) x p(i) where q(I,Tj) is quantity of an item and p(i) is external utility of item 
Definition 2. Utility of an itemset X in a transaction 

Tj is defined by: u(X,Tj) = £ ,ex u(I,Tj). 
For example, u({bc},T 3 )=2 x 6 +3 x 5 =27 in fig 1. 
Definition 3. Utility of an itemset X in a data set D 



We use u(X) to denote u D (X0 when data set D is clear in the context. 

Definition 4. Utility of a transaction 

Tjis denoted as TU(Tj) and computed as u(Tj, Tj). 

Definition 5. (High Utility Itemset (HUI)) 

An itemset X is called a high utility itemset (HUI) on a data set D if and only if u D (X) > min_util is called a 
minimum utility threshold. 

Definition 6. Transaction- Weighted Utility (TWU) 

TWU of an itmset X over a dataset D is defined as TWUd(X) = £ x^Tj A TjeoTU(Tj). 
Definition 6. Prefix Utility of an itemset X in a Transaction 



U D (X)= X X£Tj A TjeoXiEx 



u(I,Tj). 



PrefixUtil D (X,T)=X iep re fixSet(x,T) u(i,T) 



Example PrefixUtil({ac},T 3 )=u(a,T 3 ) + u(b,T 3 )+ u(c,T 3 ) =3+12+15=30 
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Definition 7. Prefix Utility of an itemset X in a Dataset D 

Example PrefixUtil( { ac })=PrefixUtil({ ac } ,T,)+PrefixUtil({ ac } ,T 2 )+ PrefixUtil({ac } ,T 3 )=8+36+30=74 
Property l.For any itemset X in ad dataset D, the following relationship holds : 

TWU D (X) >PrefixUtil D (X) > u D (X) 
Clearly, TWU D (X) > u D (X). 

TWU satisfies the downward closure property, thatis, for all Y £ X, TWU D (Y) > TWU D (X).Most of HUI mining 
methods uses the TWU values of the itemsets to prune the search space. That is, they find all the itemsets whose TWU is no 
less than the minimum utility threshold. Since TWUd(X) is an overestimate of u D (X), The procedure does not miss any 
high utiltyitemset. 

Property 2. Anti-monotone Propoerty 

It says that support for an itemset never exceed the support for its subsets. 
For example If x& y are 2 itemsets such that xfly =0, then supp(x U y) < supp(x) and 
It makes the search in the itesmet lattice easier by avoiding large number of useless cases. 
Property 3. Apriori Property 

Every subset of a frequent itesmet is also frequent 



Table 1: Dataset 



TID 


ITEMSET 


Tu 


1 


(A, 2),(B,1 ),(C,3 ),(E,2) 


35 


2 


(A, 1),(D,1 ),(F,2) 


8 


3 


(B,2 ),(C,1 ),(D,3 ),(E,3) 


21 


4 


(C, 3),(F,2) 


20 


5 


(A, 2) ,(B, 1), (D, 1) ,(E,2 ), (F,2 ) 


20 


6 


(A, 2),( B,3 ),(C,2) 


31 



Table 2: Item with Their External Utility 



Item name 


A 


B 


C 


D 


E 


F 


External Utility 


5 


3 


6 


1 


2 


1 



METHODOLOGY OF IFP 

Tree is very useful structure to store data and process. The more data store the more memory consumes. 
Researcher mostly uses tree structure which is based on FP-Growth tree model. Every Research paper has slight difference 
in using and processing transaction and maintains tree structure. 

Processing of Transaction is more important. From data stream perspective view ,data comes at very high speed, 
So processing must be fast and scan only ones when transaction arrive. The more task algorithm performs the more time it 
takes and gives more accuracy. So Algorithm has to balance between accuracy and speed. 

There are many Research papers which suggest resource aware mining is very useful. As data stream contains 
unbounded and unlimited transaction, resource must be maintain rather than overloading and total hamper processing 
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which lead to faulty results. 

As discuss in Introduction Chapter, There are three types of window model. Among them sliding window model 
is best preference if recent transaction is more important. Using Sliding window model the data are mines based on recent 
data, and current mining result is more important than older. 

DFS and BFS are two techniques which most data mining researcher are using. Both techniques are best in certain 

scenario. 

Here we are using two structures as shown below. 

Table 3: Rankltem Node Structure 



Rankltem Node 


1 


Name 


Item Name 


2 


UnitList[win_Size] 


Store Utility of item of window size 


3 


TotalUnit 


Total utility of window 


4 


TotalTrans 


Total Transaction in window that contain item 


5 


Extern alUtility 


External Utility of item 


6 


TotalProfit 


Total Profit generated by item 


7 


Hot 


Shows that item is recent or not 


Table 4: Transaction Node Structure 







Transaction Node 




1 


TransNum 


Transaction Number 


2 


ItemList 


Contain Item list of transaction 


3 


Totalltem 


Total Item transaction contain 


4 


TransUtility 


Total Transaction Utility 



As shown above Data are store in above structure. Structure is very compact but powerful in nature as it contains 
information of item based on that mining is performed. 

In the algorithm, the user provide only two input which are 'k' and 'Win_Size'. Here 'K' value defines top k high 
utility item user want. And 'Win_Size' define window size. In Window, Transactions are in batch wise. Each batch 
contains certain number of transaction. 

Below is Flow chart of Algorithm. 



Start 



Loao Transaction 



bMd Km i 



Transaction Table 



User 



stop 



Abort Process 



Generate Frequent Pattern 



TooKHUI 



Ena of Data Stieam 



User Query 
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As seen in Flowchart data comes in continuously. Here we are using sliding window model, we first load data in 
transaction node List. Also simultaneously add item information in rankitem node List. Whenever user requires result, the 
mining result shown to user and generated after each batch, it means after each batch mining is performed. Result will 
show to user only when user perform query or at the end of Data Stream. The reason is that producing result to screen takes 
more time and consumes memory and it is overload to algorithm. If user doesn't want the result right now why should 
display to screen. 

The process is very simple yet effective. Here we are not using tree structure, because this algorithm is based on 
recent item and item which generate high profit is most valuable also our structure effective track item which generate 
higher profit in recent window. Here Pruning is simultaneously done when new transactions arrive. 

Itemset which generate higher profit are sorted in Rankitem List .Based on that potential itemsets are generated. 

Algorithm 1. Loading Transaction and data into Structure 
Input: Win_size,K, Transaction T,Batch Bi, 
Output: Top-K HUIs 

1. TxNum^Tnum % Win_Size 

2. For every T 6 Bi do 



3. Transaction[TxNum].TransNum = TxNum; 

4. For every Item I £ T do 

5. Transaction[TxNum].ItemList.Add(I); 

6. Transaction[TxNum].TotalItem++; 

7. IF Item I GRankltem List then 

8. Rankitem. UnitList [TxNum]=I.InternalUtility 

9. Rankitem. ExternalUtility =I.ExternalUtility 

10. Rankitem. Hot =True 

11. Else 

12. RankItem.Name=I.Name 

13. Rankitem. UnitList [TxNum]=I.InternalUtility 

14. Rankitem. ExternalUtility =1. ExternalUtility 

15. Rankitem. Hot =True 

16. IfRankltem.totalTrans != Win_Size 

17. RankItem.totalTrans++; 

18. Utility= LExternalUtility *I.InternalUtility; 

19. Transaction[TxNum] .utility +=Utility; 

20. If it is not First Slinding window then 

2 1 . For everyrankltem node I in List do 

22. Ifrankltem node I is not hot then 

23. IfrankltemtotalTrans != 0 then 

24. RankItem.totalTrans--; 

25. Else 

26. Remove Rankitem; 

27. Else 

28. Rankltem.Hot =False 

29. For everyNew Batch do 

30. Call Algorithm 2 Mining 

31. Return Top K HUI 



In the First Algorithm, we just add data to our structure. Here Rankitem list is class structure which has 7 fields 
and utility list field is of window size. Each item information store in rankitem class structure. So there are many item in 
data stream, so it create array of Rankitem class. 

Here TxNum is transaction number that cannot greater than window size. Each Transaction is loaded into 
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Transaction Class structure than added to list of transaction. 

Here each item has internal and external utility. Both utility are store in Rankltem class structure which is in the 
Rankltem List 

Here we store and maintain information of recent window. When Window moves or new batch arrives, Rankltem 
node in list also update and same for transaction list. Advantages is that it remain compact in size. 

In algorithm 1, for each transaction we add transaction number to Transaction node. For every item in transaction 
we added into transaction's itemlist field. If item found in Rankltem list then we update information like its internal utility 
and total transaction value also hot field which says that whether it is recently arrives or not. 

After First window, for every transaction we check for the item which is not hot. For that item we decrement the 
totalTransaction field of Rankltem node .This way keep track of recent item. For each item whose value is true for hot 
field, it is updated to false. 

Mining is performed after every batch and after first window complete. 

Algorithm 2 is mining where actual work is done. In this algorithm, for every Rank item in List update Rank Item 
Toal Unit by summing up all the unit from Unit List field. After that Rank Item. Total Profit calculated. Here we min_supp 
taken as avg of all total profit by all item. Avg Total Unit is also used to filter data items. Potential is list of Rank Item 
nodes which generate higher profit and sorted descending order according to total profit. 

After that HuiList is generated by taking each item at a time and adding second highest and possible all variation. 
Also side by side we calculated utility also of itemset. It generate very high Itemset and their profit. After that it check for 
min_Supp. If it is greater than min_Supp then added to topKHui List and after that sorting is done based on profit. 

Algorithm 2. Mining 
Input: Batch Bi 
Output:Top-K HUIs 

1 . If it is not First Slinding window then 

2. For everyNew Batch do 

3. topKHui=0 

4. For every Rankltem node in List do 

5. RankItem.TotalUnit=X Rankltem. UnitList 

6. RankItem.TotalProfit=RankItem. TotalTransaction * Rankltem.totalUnit * Rankltem.ExternalUtility 

7. Min_Supp=X Rankitem.ToalProfit / Total Rankltem node in List 

8. AvgTotalUnit=X Rankitem.ToalUnit / Total Rankltem node in List 

9. Potential=List all Rankitem node where RankItem.ToalProfit>Min_Supp and Rankltem. TotalUnit>AvgTotalUnit 

10. Potential = sorting all rankitem Node by decending order of Total Profit 

1 1 . HUI List = generate a set of top k HUIs by combination taking each item at ones and adding second highest 
profitable item and so on. And calculate utility also simultaneously 

12. For every Itemset in Hui List do 

13. If utility >Min_S uppthen 

14. Add to topKHui List 

15. Sort by decending order of their profit 

16. If K >topKHui list then 

17. Display all Itemset with Their utility profit. 
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18. Else 

19. Display top K item and prune all itemsets 

20. Return topkHUI 

If k value is higher than topKHui List then print all itemset with their utility profit otherwise display only k 
itemset and prune all itemset form topkltem list. 

THE PERFORMANCE OF ICFP IS PRESENTED 

In this chapter, the proposed method is evaluated. All the algorithm are implemented in C#. The experiments are 
conducted on Intel core i3 2.53 GHz computer with 3GB RAM. 

For experiment, four datasets are used. They are chess, mushroom, connect4 and T10I4D100K. First three datasets 
were prepared by Roberto Bayardo from the UCI datasets. The Last dataset is the IBM Synthetic dataset T10I4D100K, 
where the numbers after T, I, and D represent the average transaction size, average size of maximal potential frequent 
patterns and the number of transaction, respectively. In all the dataset, External utility of item is not provided and also 
quantity of each item in transaction as well. Therefore External utility of item is generated using log normal distribution 
and quantity of item in transaction is generated randomly between 1 and 10. 



Table 5: Details of Datasets 





Transaction 


Items 


Max object in Trans. 


chess 


3196 


75 


37 


mushroom 


8124 


119 


23 


connect4 


67557 


129 


43 


T10I4D100K 


100000 


870 


29 



As the K size and winSize Varies runtime and performance also change. K value defines maximum number of top 
High utility pattern. For small value of k, Number of itemset store and maintain in top K HUI list are very less and 
therefore less memory consumption. Less memory consumption improves performance. 

WinSize is also important. For high value of WinSize, High utility itemset is having more potential to generate 
more profit. For small value of WinSize, it gives best result based on current window size transaction only. 




Chess Mushroom IBM Conrtect4 

DataSet 



Figure 1: Window Size Effect on Runtime 
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Figure 2: Window Size Effect on Candidate Itemset 

As seen in above chart, Change in winSize value affects timing. Timing is also affected by number of items in 
Rankltem list and potential Items. The more number of Potential Item the more candidate itemset generated and takes more 
time. 

Free up Rankltem node which have 0 Total transaction values reduce memory and improves performance. In this 
algorithm Mining is done at every batch. This is also affecting performance and timing. 

CONCLUSIONS AND FUTURE WORK 

Our research focuses on Designing computation and memory efficient algorithms to provide approximate results 
in high accuracy and confidence and developing system support help to mine useful information from data streams. Some 
specific research problems are identified. Result is as expected also can be improved by updating support value frequently 
based on window maximum and minimum value of itemset. Potential itemset generation is based on high value of frequent 
item. The itemset is for current window only so it can be improved by updating and maintain older itemset and prune based 
on time fading model. Meanwhile, the new techniques and algorithms we have developed for frequent itemset mining on 
streaming data are presented, and some preliminary ideas of our on-going work are discussed. Currently, we continue 
working on some of these problems. 
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